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Abstract 

In this article, we advocate the ensemble approach for variable selection. We point out 
that the stochastic mechanism used to generate the variable-selection ensemble (VSE) must be 
picked with care. We construct a VSE using a stochastic stepwise algorithm, and compare its 
performance with numerous state-of-the-art algorithms. 
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1 Introduction 



The ense mble approach for statistical modellinK was fir st made p opular by such algorithms as 
boosting ( Freund and Schapirelll996t [Friedman et al. 12000). bagging (|Breimanlll996l) , random forest 



( Breiman 



20011 ). and the gradient boosting machine 



Friedman 



20011 ) . They are powerful algorithms 



for solving prediction problems. This article is concerned with using the ensemble approach for a 
different problem, variable selection. We shall use the terms "prediction ensemble" and "variable- 
selection ensemble" to differentiate ensembles used for these different purposes. 

1.1 Variable-selection ensembles (VSEs) 

First, we give a general description of variable-selection ensembles (VSEs). Suppose there are p 
candidate variables. A VSE (of size B) can be represented hy a. B x p matrix, say E, whose j-th 
column contains B independent measures of how important variable j is. Let E(&, j) denote the 
{b, j)-th. entry of E. Using the ensemble E as a whole, one typically ranks the importance of variable 
j using a majority- vote type of summary, such as 



J) 



(1) 



b=l 



and the variables that are ranked "considerably higher" than the rest are then selected. 

The key for generating a VSE lies in producing multiple measures of importance for each candi- 
date variable. By contrast, traditional variable selection procedures, including stepwise selection and 
Lasso, typically produce just one such measure, that is, i? = 1. It shouldn't be hard for any statis- 
tician to appreciate that averaging over a number of independent measures is often beneficial. This 
is the main reason why VSEs are attractive and more powerful than many traditional approaches. 

To make selection decisions, one must be more precise about what it means to say that some 
variables are ranked "considerably higher" than the rest. One option is to select variable j if it is 
ranked "above average," i.e., if 



m > 



p 



E 

fe=i 



R{k). 



(2) 



This is what we use in all the experiments reported below, but we emphasize that other thresholding 
rules can be used as well. For example, one can make a so-called "scree plot" of R{1), -R(2), Rjp) 
and look for an "elbow" — a very common practice in principal component analysis (e.g. 



Jolliffe 



20021 ). but the precise location of the "elbow" is highly subjective, which is why we choose not to 
use this strategy here. 

The distinction between ranking and thresholding is particularly important for VSEs; we will 
say more about this in Section l4Tl 
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1.2 PGA 

Zhu and Chipmaij (|2006l ) constructed a VSE using a so-called parallel genetic algorithm (PGA). 
To produce multiple measures of variable importance, PGA repeatedly perfo rms stochastic rather 
than deterministic optimization of the Akaike Information Criterion (AIC; lAkaikd 119731) . while 



deliberately stopping each optimization path prematurely. In practice, one must be more exact 
about what "premature" means, but we won't go into the specifics here and refer the readers to 
Zhu and ChipmanI (|2006l Section 3.2) for details. 

The main idea is as follows. Early termination forces each optimization path to produce sub- 
optimal rather than optimal solutions, while the use of stochastic rather than det erministic opti- 
mization allows each of these sub-optimal solutions to be different from each other (|Zhull2008() . For 
example, suppose we have five candidate variables, Xi, a;2, 0:5. The first time we stochastically 
optimize the AIC for just a few steps, we may arrive at the solution {xi, 2:2, 2:3}; the second time, 
we may arrive at {xi, X2, x^}; and the third time, perhaps {xi,X2,xz}. This produces the following 
ensemble: 



E : 







Since 



i?(l) = i?(2) = 1 > 3 = fi(3) = i?(4) = i?(5). 



the ensemble selects | xi , X2 



Zhu and ChipmanI (|2006|) used the genetic algorithm (GA; IColdbergl 119891 ) as their stochastic 



optimizer in each path, but our general description of VSEs above (Section ll.ip makes it clear that 
any other stochastic optimizer can be used for PGA to work, despite the name "PGA." This is a 
crucial point, and we will come back to it later (Sections 11.31 and 14.21) . 

Though driven by the AIC, it has been observed that PGA has a much higher probability of 
selecting the correct subset of variables than optimizing the AIC by exhaustive search — of course, 
such observations have only been made on mid-sized simulation problems where an exhaustive search 
is feasible and the correct subset is known. Nonetheless, they show quite conclusively that PGA is 
not merely a better search algorithm, because one cannot possibly perform a better search than an 
exhaustive one. Therefore, PGA can be seen as an effective AIC "booster," and this is precisely why 
the ensemble approach to variable selection is valuable and powerful. 



1.3 Motivating example: A weak signal 

One of the key objectives of this article is to present a better variable-selection e nsemble than what 
PGA produces, one which we call the stochastic stepwise ensemble (ST2E). Like lZhu and Chipman 
(|2006l ). we also focus on multiple linear regression models, although VSEs are easily applicable to 
other statistical models such as logistic regression and Cox regression. 
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We first describe a simple experiment to motivate our work. There are 20 potential predictors, 
xi,X2, ...,X2o, but only three of them are actually used in the true model to generate the response, 

y: 

y = axi + 2x2 + 3x3 + ere, Xi, X20, e ~ N(0, 1). (3) 

The sample size n is taken to be 100, and a — i. In addition, the three variables that generate y 
are correlated, with Corr(xi,Xj) — 0.7 for i,j G {1,2,3} and i ^ j. The x's and e are otherwise 
independent of each other. 

We shall consider a — 0.1,0.2, ...,0.9, 1.0, 1.2 and 1.5. By construction, there are three types of 
variables: xi is a relatively weak variable — it is part of the true model but its signal-to-noise ratio 
is relatively low; X2 and X3 are relatively strong variables; and X4, X20 are noise variables — they 
are not part of the true model. 

For each a, the experiment is repeated 100 times. Figure [T] shows the average frequency the three 
different types of variables are selected by two different VSEs, PGA and ST2E, as our experimental 
parameter a varies. The two VSEs are both of size B = 300. 

The messages from this experiment are as follows. In terms of catching the strong signals (x2 
and X3), ST2E and PGA are about the same. In terms of guarding against the noise variables (xj for 
j > 3), ST2E is slightly better than PGA. But, most importantly, we see that ST2E is significantly 
better than PGA at catching the weak signal, xi. It is in this sense that ST2E is a better VSE than 
PGA. 

The improved performance of ST2E is due to the use of a more structured stochastic optimizer 
in each path. In particular, a so-called stochastic stepwise (STST or simply ST2) algorithm is used, 
which is why we call it the "stochastic stepwise ensemble" (or ST2E). Explanations for why a more 
structured stochastic optimizer is desirable are given in Section [2.3.11 



Pr{Xj included), j=1 - Weak Signal Pr(Xj included), j=2,3 - Strong Signal Pr(Xj included), i>3 - Noise 




0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 



Figure 1: Motivating example (Section ll.3|) . Average frequency the three different types of variables 
are selected by ST2E and PGA. Notice that the vertical scales are not identical. 
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1.4 Random Lasso and stability selection 



The idea of using the ensemble approach for vari able selection has started to catch up in recent 
years, fo r example, the "random Lasso" in ethod (jWang et a^.l 120091) and the "stability selection" 
method (IMeinshausen and Biihlmannl 120101) . The latter consists of a general class of methods for 
structural estimation, including graphical modeling, but we shall limit our discussions here to the 
regression problem. 

For regression problems, these algorithms essentially give rise to different VSEs as we have defined 
them in Section II. 1[ although they are not explicitly presented or labeled as ensemble algorithms 
by the authors. The "random Lasso" method, for instance, was presented to "alleviate [various] 
limitations of [the] Lasso" when dealing with highly correlated variables and "large p, small n" 
problems. The "stability selection" method, on the other hand, was presented to reduce the influence 
of regularization parameters and to provide "finite sample familywise error control" for the expected 
number of false discoveries. We shall see later that, while this method does provide very good error 
control for false discoveries, it appears to do so at the expense of missing true signals. 

Even though one can apply sta bility selection to different regression procedures, we will consider 
only its application to the Lasso (jTibshiranil 1 1 9961 ) . At a very high level, both the random Lasso 
and the stability selection algorithms consist of running a randomized Lasso repeatedly on many 
bootstrap samples and aggregating the resu lts. That is. they c an both be regarded as VSEs. They 
differ on how the Lasso is randomized. For lWang et ali (j2009l ). each Lasso is run with a randomly 



selected subset of variables; both the size of this subset and the 1 1 -re gularization parameter A are 



Meinshausen and Biihlmann 



selecte d by cross validation (CV) and fixed at the CV choices. For 
(|2010l ). all regularization parameters A > X„iin are considered; each Lasso is run by randomly scaling 
the regularization parameter for every regression coefficient; and the parameter Xmin is chosen to 
control the expected number of false discoveries. 



1.5 Outline 

We proceed as follows. In Section [2l we describe a more structured approach to generate VSEs, 
the ST2 algorithm. We explain why the ST2 algorithm produces better ensembles than the genetic 
algorithm. We also describe how to set the tuning parameter in ST2. In Section [3l we present a 
number of simulated and real-data examples, and c ompare ST2E with a n umber of other variable 



selection algo rithms, such as La sso (ITibshirani 1996 ) and i ts variations (e.g 



20071). LARS (lEfron et aLll2004D . SCAD (|Fan and Lill200lD . elastic net (IZou and Hastiell2005D . VISA 



Zou 



2006 



Meinshausen 



( Radchenko and Jamesll2008r) . and the two e nsemble approaches mentioned abo ve — random Lasso 
(jWang et aLl 120091 ) and stability selection (jMeinshausen and Biihlmannl 120101 ). In Section lU we 
make a few more general comments about the ensemble approach for variable selection, and present 
a simple extension of ST2E to tackle "large p, small n" problems. 
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2 Stochastic stepwise selection 



In this section, we describe a more structured stochastic search algorithm suitable for variable 
selection. In section [2.3.11 some explanations will be given as to why a more structured stochastic 
search is desirable. 

2.1 The ST2 algorithm 

Traditional stepwise regression combines forward and backward selection, alternating between for- 
ward and backward steps. In the forward step, each variable other than those already included is 
added to the current model, one at a time, and the one that can best improve the objective function, 
e.g., the AIC, is retained. In the backward step, each variable already included is deleted from the 
current model, one at a time, and the one that can best improve the objective function is discarded. 
The algorithm continues until no improvement can be made by either the forward or the backward 
step. 

Instead of adding or deleting variables one at a time, ST2 adds or deletes a group of variables 
at a time, where the group size is randomly decided. In traditional stepwise, the group size is one 
and each candidate variable is assessed. When the group size is larger than one, as is often the case 
for ST2, the total number of variable groups can be quite large. Instead of evaluating all possible 
groups, only a randomly selected few are assessed and the best one chosen. Table [1] contains a 
detailed description of the ST2 algorithm. 

2.2 Tuning functions 

We now explain how the numbers gf,g{,,kf,ki, (see Table [T]) are determined. Suppose we are doing 
a forward (backward) step and the potential predictors to be added (deleted) are {xi,X2, ■■■,Xm}- 
First, we need to determine g, the number of variables to add (delete). Intuitively, it seems reasonable 
that g should depend on m, say 

g = <f>g{m). 

Second, given g, we have a total of (™) possible groups of variables and need to determine fc, the 
number of groups to assess. Intuitively, it also seems reasonable that k should depend on (™) , say 

k = (pk{in,g). 

We define the function (pg as 

0g(m) ~ Unif(*„0, 

where 

= {1,2,3,..., [Am + 0.5J}, 
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Table 1: The stochastic stepwise (STST or simply ST2) algorithm for variable selection. 
Repeat 

1. (Forward Step) Suppose d variables, {xi-^, xi^, xi^} are not in the current model. 
Initially, d — p. 

(a) Determine the number of variables we want to add into the model, or the group 
size, say gf < d. t 

(b) Determine the number of candidate groups that will be assessed, kf. ^ 

(c) Generate kf candidate groups of size <?/, each by randomly choosing gf variables 
without replacement from the set, {xi^^jXi^, ...,xi^}. 

(d) Assess each candidate group by adding it into the current model, one group at 
a time. The one that can best improve the objective function (e.g., the AIC) is 
added into the model. 

2. (Backward Step) Suppose h variables, {xi-^jXi^, ...,xi,^} are in the current model. 

(a) Determine the number of variables we want to delete from the model, or the 
group size, say gb < h. ^ 

(b) Determine the number of candidate groups that will be assessed, kb- ^ 

(c) Generate kb candidate groups of size gb, each by randomly choosing gb variables 
without replacement from the set, {xi-^,xi2, ■■■,xii^}. 

(d) Assess each candidate group by deleting it from the current model, one group 
at a time. The one that can best improve the objective function (e.g., the AIC) 
is deleted from the model. 

Until 

no improvement can be made by either the forward or the backward step. 



Details for how the numbers gf,kf,gi,,ki, axe determined are given in Sections 12.21 and 12.31 
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for some < A < 1, and [a;J is the largest integer not greater than x. The function (pk is defined as: 

1/k 



<Pk{m,g) 



0.5 



(4) 



for some n> 1. We fix A = 1/2 and discuss how to choose the parameter k later (Section l2.3|) . Here, 
we first explain why the functions (j)g and (j)^ are chosen to have these particular forms, and why we 
fix A = 1/2. 

First, it is important that g = (j)g{m) is a stochastic and not a deterministic function. Consider 
the first forward step and the first backward step. Suppose there are m — p = 20 potential predictors. 
If 4>g{m) is a deterministic function, say (j)g{20) — 7 and 4>g{7) — 3, then, for all individual paths, 
only models with 7 predictors are assessed by the forward step and those with 7 — 3 = 4 predictors 
are assessed by the backward step. Many models, such as those with 2 or 3 predictors, are never 
assessed. Clearly, more flexibility is needed. According to our definition and with A = 1/2, 0g(2O) ~ 
Unif{l, 2, 10}. In the first forward step, some paths may add 3 variables, while others may add 
1, 2, 4, 5, or 10 variables. 

Next, it makes sense that we should not add or delete too many variables in a single step. This 
is why we fix A = 1/2, so that at the most half of the available variables can be added or deleted. 
Notice that this restriction only applies to single search steps; it does not preclude the algorithm 
(which consists of many such steps) from selecting all the variables if necessary. 

Finally, we want the function k = (j)k{m,g) to be monotonically increasing in (™) — as more 
subsets become available, more candidate groups should be assessed in order to have a reasonable 
chance of finding an improvement. However, we cannot afford to let this grow linearly in since 



it can be a very large number and that's why we use 



1 1/« 



for some k > 1. 



2.3 Tuning parameters 

We now explain how to choose the tuning parameter, k. For prediction ensembles, cross validation 
can be used to adjust various tuning parameters in order to maximize prediction accuracy, but 
this requires our validation data to contain the right answer. For variable-selection ensembles, 
however, this is not viable. We cannot empirically adjust its tuning parameter(s) to maximize 
selection accuracy because, no matter how we set data aside for validation, we won't know which 
variables are in the "true model" and which ones are not. For this reason, many researchers still use 
cross- validated prediction e rror to guide the choice of tuning parameters for their variable-selection 
procedure, but lYansJ (|2005l) has clearly established that prediction accur a,cy and s e lection accuracy 
are fundamentally at odds with each other. In a more recent article, IShmuelil (|2010l ) discusses 
various related issues from a less technical and more philosophical point of view. It must be stated 
without ambiguity that we are aiming for selection accuracy, not prediction a ccuracy. As such, cross 
validation is out of the question. Instead, we use ideas from Isreiman (2001) to help us specify our 
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tuning parameter. 



2.3.1 Strength-diversity tradeoff 



BreimanI (|200ll ) studied prediction ensembles that he cahed random forests (RF): 



RF = {/(x; Ob) : Ot Vg, 6-1,2, B}. 



ad . 



For classification, /(x; dt,) is a classifier completely parameterized by 6b, and the statement "6h 
mean s that each /(• ; 9b) is generated using an iid stoc hastic me chanism, "Pe, e.g., bootstrap sampling 
(e-S. 



Breiman 



19961) and/or random subspaces (e.g.. lHdll998l ). 



BreimanI (|2001l ) proved that, for a random forest to be effective, individual members of the 



ensemble must be as good classifiers as possible, but they must also be as uncorrelated to each other 
as possible. In other words, a good classification ensemble is a diverse collection of strong classifiers. 

Typically, the diversity of an ensemble can be increased by increasing Var('P0). But, unfor- 
tunately, this almost always reduces its strength. Therefore, it is important to use a stochastic 
mechanis m Vg with a "reasonable" Var (Pg). This basic principle has been noted elsewhere, too. For 
example, iFriedman and Popescul (j2003() described ensembles from an importance-sampling point of 
view. There, the corresponding notion of Var(Pg) is simply the variance of the importance-sampling 
distribution, also referred to as the trial distribution. It is well-known in the im portance-s ampling 

Section 



Lh][2001l 



literature that the variance of the trial distribution must be specified carefully (e.g. 
2.5). 

Although Breiman's theory of strength-diversity tradeoff is developed for prediction ensembles, 
some of these ideas can be borrowed for VSEs. In fact, this tradeoff explains why ST2E produces 
better variable-selection ensembles than PGA. Recall that PGA uses the genetic algorithm as the 
main stochastic mechanism to produce the ensemble, whereas ST2E uses a more structured ST2 
algorithm (Section 12.11) . Using Breiman's language, we can say that the more structured ST2 
algorithm has a "more reasonable" variance than the genetic algorithm. It also allows us to exercise 
more control over the algorithm via the tuning parameter k in order to better balance the intricate 
tradeoff between strength and diversity. 



2.3.2 Computable measures of strength and diversity 

We now describe how to use the diversity-strength tradeoff to specify the tuning parameter, k. Given 
p potential covariates, let E be a VSE of size B. The idea is to define computable measures of its 
diversity and strength, say X'(E) and 5(E), and choose k to balance the tradeoff between them. 

Given a VSE, E, its diversity X'(E) can be measured by the average within-ensemble variation. 
For every variable j, there are B independent measures of how important the variable is. The 
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quantity 



-I 2 



1 



5] E(6,j)-7^Ee(^J) 



(5) 



B - 1 



b=l L b=l 



is the within-ensemble variance of these measures, and the quantity 



(6) 



is a measure of the average within-ensemble variation. 

Let F{-) denote the objective function that each path of the VSE aims to optimize. We measure 
the mean strength of E by the average percent improvement of the objective function over the null 
model, i.e., 



where Fq is the objective function evaluated at the null model, i.e., a model that does not contain 
any predictors. 

2.3.3 Example 

Figure [5] shows an example of how our tuning strategy works. This is based on 50 simulations using 
equation ([3]), while fixing a — 1. The quantities iS(E) and X'(E) are plotted against k (left and 
middle). For k, the logarithmic scale is used. Variable-selection performance, measured here by 



where "ASF" stands for the average selection frequency, is also plotted (right). 

The behavior depicted here is fairly typical. We see from the left panel that iS(E) tends to 
decrease as we increase k. This is because, when k is relatively small, the search conducted by 
steps 1(d) and 2(d) in the ST2 algorithm (Tabled]) is relatively greedy; many candidate subsets (of 
a certain given size) are evaluated and the best one, chosen. This results in high strength. As k 
increases, the search becomes less greedy, which reduces strength. 

On the other hand, the greedier the search, the higher the chance of finding the same subset. 
This explains why a small k tends to produce low diversity. As k increases and the search becomes 
less greedy, diversity starts to increase. But the parameter k controls not only the greediness but 
also the scope of the search. For example, it is easy to see from (j4]) that (j)k{m,g) — > 1 as k — > oo. 
This means that, when k is very large, the scope of the search performed by steps 1(d) and 2(d) 
becomes quite limited, which also reduces diversity. This is why we see that, in the middle panel, 
the diversity measure 2?(E) first increases but eventually decreases with k. 




B 



(7) 



Perf(E) = ASF(Signal) - ASF(Noise), 
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strength, S(E) 



Diversity, D(E) 



Performance, Perf(E) 




log(K) log(K) log(K) 



Figure 2: Illustration of how to select the tuning parameter, k, in ST2E. Based on 50 simulations 
of model ([3]) with a = 1. 

Finally, we see from the right panel that, as I?(E) reaches its peak level, the variable-selection 
performance also starts to level off or drop. This means choosing k by looking for the peak in the 
I?(E) plot can be an effective strategy. This is what we use in all the experiments we report below. 

Of course, we must emphasize that the measure Perf(E) and hence the right panel of Figure [2] 
are typically not available; they are only available for simulated examples where the true model is 
known. We include them here merely as a validation that it makes sense to choose k in such a way. 
In reality, one must rely on the plot of alone to make the choice. 

2.4 Effect of sample size 

Before we move on to examples and performance comparisons, we conduct a simple experiment to 
examine the performance of ST2E as the sample size n increases. As in Section [2.3.31 we simulate 
from our motivating example, equation ([3]), except wc fix a = 0.15 this time. Using a small a 
creates a more difficult problem, which will allow us to see the effect of n more clearly. For each 
n = 50, 100, 150, 250 and 500, we perform 100 simulations and record the average number of times 
the three types of variables (strong, weak, noise) are selected. Table [2] shows that ST2E behaves 
"reasonably" in this respect. As n increases, signal variables (both strong and weak) are selected 
with increasing probability, and the opposite is true for noise variables. 



3 Examples 

We now present a few examples and compare the performance of ST2E with a number of other 
methods. For VSEs (ST2E, PGA, and stability selection), we use a siz e of B = 300. To run stability 
selecti on, two tuning parameters are required, '^thr and Xrnin SCC 



Meinshausen and Biihlmann 



POIOT ) for details. The authors suggested fixing cither one at a "sensible" value and choosing the 
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Table 2: Average number of times (out of 100) the three types of variables — weak signal {j — 1), 
strong signal (j — 2,3), and noise (j > 3) — are selected by ST2E as the sample size n varies. Same 
setting as the motivating example in Section FOl with a = 0.15. 



Sample Size 
(n) 


Xj £ Weak Signal 
ij = 1) 


Xj G Strong Signal 
(j-2,3) 


Xj G Noise 
U > 3) 


50 


62 


99 


17.12 


100 


86 


100 


15.41 


150 


89 


100 


13.18 


250 


97 


100 


11.82 


500 


99 


100 


11.53 



other to control the expected number of false discoveries. We fix nthr ~ 0.6, and report results for 
a range of Xmin values. 



3.1 A widely used benchmark 

First, we look at a widely used benchmark simulation. There are p — 8 variables, Xi, ...,X8, each 
generated from the standard normal. Furthermore, the variables are generated to be correlated, 
with p(xi,Xj) = 0.5l'~^l for all i ^ j. The response y is generated by 



y = 3xi + 1.5x2 + 2x5 + ere 



(8) 



where e ^ N(0, 1). That is, only three vari a bles a re true signals; the remaining five are noise. 
This benchmark was first used bv iTibshiranil (|1996[ ). but it has been used by almost every major 
variable-selectio n paper pubhshed e ver since . 

For example. iFan and Li ( 200ll Example 4.1) used this si mulation to c ompare Lasso, the non- 



Miller 



2002), and SCAD 



negat ive garrote (lBreimaij|l995), hard threshold in g (see, e.g., lDonohdll995l) . best subset regression 



( Meinshausen 



Results from 



20071). VISA (IRadchenko and Jamedl2008l) . and random Lasso (|Wang et a^.l 120091 ) 



Fan and Lil 20011) . IWang et ali (120091 Example 1) us ed this Simula- 



(e.g., 

tion to compare La sso, ada ptive Lasso (IZoull2006l). elasti c net (jZou and Hastie 



2005). relaxed Lasso 



Fan and Lil (|200ll) are replicated in Table [31 together with results from three VSEs 



— ST2E, PGA, and stability selection. The ensemble algorithms are clearly among the most com- 
petitive. PGA and stability selection using a relatively large Xmin are slightly better than ST2E 
at excluding noise variables, but, as will become clearer in Table |4j they are also more likely than 
ST2E to miss tr ue signals. 



(|2009[ ) are replicated in TableU together with results from ST2E, PGA, 
and stability selection. Here, the difficulty of PGA becomes clearer. It is better than ST2E in terms 
of excluding noise variables, but it also misses true signals more often. The same can be said about 
stability selection using a relatively large Xmin value. It controls the number of false discoveries 
quite effectively, but we see that this will cause the method to behave poorly in terms of catching 
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Table 3: Widely-used benchmark fSection I3.1[) . Average number of zero coefficients for the noise 
group (oracle result — 5) and the signal group (oracle result = 0), based o n 100 simulat i ons. R esults 

Table 



other than those for ST2E, PGA and stability selection are taken from iFan and Lil ()2001 . 
1). SCADl uses cross-vali dation to sel e ct the tuning parameter, while SCAD2 uses a fixed tuning 
parameter — details are in lFan and l1 (|2001|) . 
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3.56 


0.00 


Hard 


4.02 


0.00 


Best subset 


4.73 


0.00 


Garrote 


3.38 


0.00 


Oracle 


5.00 


0.00 


ST2E 


4.81 


0.00 


PGA 


4.96 


0.00 


Stability selection 






•^mm — 1-5 


5.00 


0.21 


•^mm — 1-0 


5.00 


0.01 




4.95 


0.00 
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true signals. In order to improve its false negative rates, we must reduce Xmim but this necessarily 
allows for more false discoveries. Random Lasso has the best ability to catch true signals, but to do 
so, it makes a large number of false discoveries at the same time. 



3.2 Highly-correlated predictors 



Next, we look at a simulation (jWang et a/j|2009l Example 4) that is specifically created to study 
variable selection algorithms when the predictors are highly correlated and their coefficients have 
opposite signs. There are p — 40 variables, Xi, ...,X4o, each generated from the standard normal. 
The response y is generated by 



y = 3xi + 3x2 — 2x3 + 3x4 + 3x5 — 2x6 + o'e 



(9) 



where e ^ N(0, 1) and (7 = 6. In addition, the 40 variables are generated to have the following (block 
diagonal) correlation structure: 



— C3x3 



where C — 



1 0.9 0.9 
0.9 1 0.9 
0.9 0.9 1 



That is, there are three groups of variables: Vi = {1,2,3}, V2 = {4,5,6}, and V3 = {7, 8, 40}. 
The first two groups, Vi and V2, are true signals, but, within each of Vi and V2, the variables are 
highly correlate d. There is n o betw een-group correlation. 

(|2009[ ) are replicated in Table [3 together with results from the ST2E, 



Results from I Wang et al. 

PGA, and stability selection. Clearly, ST2E and random Lasso are the top two performers for this 
problem. PGA has a much higher tendency than ST2E to miss true signals. Stability selection 
appears to have some difficulties here as well. If we choose Xmin to control the false discoveries too 
aggressively, we start to lose signals quite badly. When its false discovery rate is as low as that of 
relaxed Lasso and VISA, for example, stability selection misses true signals much more often. On 
the other hand, even if Xmm is relaxed to allow for many false discoveries in this case, stability 
selection still cannot seem to catch the signals as well as ST2E or random Lasso. This example 
clearly demonstrates that, when the predictors are highly correlated, the performances of PGA and 
stability selection can start to deteriorate quite significantly, whereas ST2E produces a much more 
robust VSE. 



3.3 Diabetes data 

Finally, we analyze the "d iabetes" data set, which was used as the main example in the "least angle 



regression" (LAR) paper (jEfron et al\\2004) . This is a real data set; a standardized version of the 
data set is available as part of the R package, lars. There are n — 442 diabetes patients and p — 10 
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Table 4: Widely- used benchmark (Section 13.11) . Minimal, median, and maximal number of times 
(out of 100 simulations) different types of variables (signal versus noise) are s e lected . Results other 
than those for ST2E, PGA and stability selection are taken from lWang et MI (|2009l Table 2). 







Xj e Signal 




Xo e Noise 








(i = 1,2,5) 




[j = 3, 


4,6,7, 


8) 


Method 


Min 


Median 


Max 


Min Median 


Max 


n = 50, (7=1 














Lasso 


100 


100 


100 


46 


58 


64 


Adaptive Lasso 


100 


100 


100 


23 


27 


38 


Elastic Net 


100 


100 


100 


46 


59 


64 


Relaxed Lasso 


100 


100 


100 


10 


15 


19 


VISA 


100 


100 


100 


11 


17 


20 


Random Lasso 


100 


100 


100 


28 


33 


44 


ST2E 


100 


100 


100 


1 


1 


8 


PGA 


100 


100 


100 





2 


6 


Stability selection 














^min — 1-5 


75 


86 


100 








2 


^min 1-0 


100 


100 


100 








2 


Amm — 0.5 


100 


100 


100 








7 


n = 50, (7 = 3 














Lasso 


99 


100 


100 


48 


55 


61 


Adaptive Lasso 


95 


99 


100 


33 


40 


48 


Elastic Net 


100 


100 


100 


44 


55 


69 


Relaxed Lasso 


93 


100 


100 


11 


18 


21 


VISA 


97 


100 


100 


15 


21 


24 


Random Lasso 


99 


100 


100 


45 


57 


68 


ST2E 


89 


96 


100 


4 


12 


20 


PGA 


82 


98 


100 


4 


7 


11 


Stability selection 














Amm — 1.5 


59 


64 


100 








3 


Amm — 1.0 


81 


83 


100 





2 


9 


Amm — 0.5 


90 


98 


100 


4 


8 


22 


n = 50, (7 = 6 














Lasso 


76 


85 


99 


47 


49 


53 


Adaptive Lasso 


62 


76 


96 


32 


36 


38 


Elastic Net 


85 


92 


100 


43 


51 


70 


Relaxed Lasso 


60 


70 


98 


15 


19 


21 


VISA 


61 


72 


98 


15 


19 


24 


Random Lasso 


92 


94 


100 


40 


48 


58 


ST2E 


68 


69 


96 


9 


13 


21 


PGA 


54 


76 


94 


9 


14 


16 


Stability selection 














^min — 1-5 


40 


41 


83 





4 


8 


^min — 1-0 


59 


61 


92 


4 


8 


18 


^min — 0.5 


76 


84 


100 


30 


42 


50 



15 



Table 5: Highly-correlated predictors (Section l3.2p . Minimal, median, and maximal number of times 
(out of 100 simulations) different types of variables (signal versus noise) are se l ected . Results other 
than those for ST2E, PGA, and stability selection are taken from lWang et al. I ()2009l Table 2). 



IVlcljliOQ 


ivim 


■V" ■ CZ "^1 mn o 1 

[J ^ ^) 

MedicLii 


A/To V 
iViaX 


iviin 


t i N Ulbt: 

\J > 
Mcd-ia-n 


iVldX 


n — ou 














Lasso 


1 1 
li 


( u 


i i 


1 


1 7 


ZD 


-TiU-Cip 11 Vt, IJClobU 


1 f\ 






A 


Q 
O 


1 A 


H/lcLStlC iMcC 


oo 


Q9 


yo 


Q 


1 7 


zo 


Relaxed- La.sso 


A 


Do 


70 




u 


A 


Q 

y 


VISA 


4 


62 


73 


1 


3 


8 


Random Lasso 


84 


96 


97 


11 


21 


30 


ST2E 


85 


96 


100 


18 


25 


34 


PGA 


55 


87 


90 


14 


23 


32 


Stability selection 














^min — 0-1 


1 


35 


42 


1 


5 


13 


^min — 0.01 


1 


37 


45 


7 


13 


22 


A„„„ = 0.002 


1 


40 


52 


31 


42 


54 


n = 100 














Lasso 


8 


84 


88 


12 


22 


31 


Adaptive Lasso 


17 


62 


72 


4 


10 


14 


Elastic Net 


70 


98 


99 


7 


14 


21 


Relaxed Lasso 


3 


75 


84 


1 


3 


8 


VISA 


3 


76 


85 


1 


4 


9 


Random Lasso 


89 


99 


99 


8 


14 


21 


ST2E 


93 


100 


100 


14 


21 


27 


PGA 


40 


85 


92 


13 


22 


33 


Stability selection 














Amm — 0.5 


1 


67 


73 


3 


8 


13 


Amm — 0.3 


2 


69 


75 


13 


26 


32 


^min — 0-2 


3 


71 


78 


60 


72 


78 
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variables, such as age, sex, body mass index, and so on. The response is a measure of disease 
progression. 

Figure 13] shows resuhs from both LAR and ST2E. For LAR, the entire solution paths are displayed 
for all the variables. As the penalty size decreases, the variables enter the model sequentially. The 
order in which they enter the model is listed in Table [HI For ST2E, the variable importance measure 
^ are plotted for each variable. The order in which these variables are ranked by ST2E is also 
listed in Tabled 

LAR ST2E 



bmi 

itg 

map 

tc 

sex 

Idl hdl 

glu 

tch 

age 

a!o 02 04 06 08 7^ 2 4 6 8 10 

|beta|/max|beta| Variables 

Figure 3: Diabetes data. Left: Results from least angle regression. Right: Results from ST2E. 

According to Table HI LAR and ST2E agree that "bmi" , "Itg" , and "map" are the top three 
variables, whereas "age" is the least important variable. For the intermediate variables, LAR and 
ST2E seem to disagree on their relative importance. For example, the variable "Idl" is the last one 
to be entered into the model by LAR before the variable "age" , an indication that it is perhaps not 
an important variable, whereas ST2E ranks it in the middle. 

Upon closer examination, however, we can see from Figure [3] that, once "Idl" is in the model, it 
actually gets a relatively large coefficient, larger than some of the other variables that were entered 
earlier — this can almost certainly be attributed to "Idl" being highly correlated with these other 
variables. Therefore, there is good reason why ST2E does not rank "Idl" close to the bottom. Similar 
statements can be made for "tc" , "hdl" , and "sex" . 
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Table 6: Diabetes data. 



Method 


Ordering and Ranking of Variables 
(top — > bottom) 


LAR 
ST2E 


bmi Itg map hdl sex glu tc tch Idl age 
bmi Itg map tc sex Idl hdl glu tch age 



4 Discussions 

Before we end, we would like discuss a few important issues. 

4.1 Ranking versus thresholding 

The variable importance measure ([1} is a particularly nice feature for the ensemble approach. Using 
an ensemble approach, variable selection is performed in two steps: ranking and thresholding. We 
first rank the variables, e.g., by equation ([TJ, and then use a thresholding rule to make the selection, 
e.g., equation ([2]). 

As proponents of the ensemble approach, we are of the opinion that the task of ranking is the 
more fundamental of the two. From a decision-theoretic point of view, once the variables are ranked, 
the choice of the decision threshold has more to do with one's prior belief of how sparse the model 
is likely to be. 

We also think that variable selection per se is not quite the right objective, whereas variable 
ranking is. Imagine the problem of searching for a few biomarkers that are associated with a certain 
disease. What type of answer is more useful to the medical doctor? Telling her that you think it is 
biomarkers A, B, and C that are associated? Or giving her a ranked list of the biomarkers? Such a 
list is precisely what the ensemble approach aims to provide. 

Therefore, we would have preferred not to introduce any thresholding rule at all. But, in order 
to compare with other methods in the literature, it is not enough to just rank the variables; we 
must make an active selection decision. In fact, because experiments are typically repeated multiple 
times, we must use a thresholding rule that is automatic, such as equation ([2]). However, it is not 
our intention to take this thresholding rule too seriously; it is only introduced so that we could 
produce the same type of results as everybody else on various benchmark experiments. As far as 
we are concerned, the output of variable-selection ensembles is the variable importance measure ([1}, 
which can be used to rank potential variables. 

4.2 Stochastic generating mechanisms 

It is clear from our definition (Section II. ip that there can be many ways to construct VSEs. To do 
so, we must employ a stochastic mechanism so that we can repeatedly perform traditional variable 
selection and obtain slightly different answers each time. One way to achieve this is to use a stochastic 
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Table 7: Stochastic generating mechanisms for different VSEs. 



VSE 


Generating Mechanism 


PGA 
ST2E 

Random Lasso 
Stability Selection 


genetic algorithm 

stochastic stepwise search algorithm 
bootstrap + random subsets 

bootstrap + random scaling of regularization parameter 



rathe r than a deterministic search algorithm to perform the selection, e.g., PGA (|Zhu and Chipman 
20061 ) and ST2E (t his paper). Anot her way is to perform t he selection on bootstrap samples , 
e.g., random Lasso (jWang et a/.l 120091 ) and stability selection (jMeinshausen and Biihlmannl 120101 ). 
According to our own empirical experiences (not reported in this paper) , bootstrapping alone usually 
does not generate enough diversity within the ensemble to give satisfactory performance. This 
explains why extra randomization is employed by both the random Lasso and the stability selection 
methods; see Section [L4l Table [7] summarizes the stochastic mechanisms used to generate different 
VSEs. 

This leads us to a very natural question: what makes a good stochastic mechanism for generating 
VSEs? This question will likely take some time to answer; we certainly don't have an answer at the 
moment. What we have shown in this paper is the following: the ST2 algorithm appears to be quite 
an effective VSE-generating mechanism, and various ideas we have used to develop ST2E can help 
us think about this kind of questions in a more systematic manner. 



4.3 "Large p, small n" problems 

We now describe a simple extension that allows ST2E to tackle "large p, small n" problems. To do so, 
we simply insert a p re-screening step b efore running each ST2 path by performing "sure independence 
screening" or SIS (|Fan and Lvl 120081 ) on a bootstrap sample to pre-select q < n variables, e.g., 
q — n — 1. The ST2 algorithm is then applied to this subset of q variables. Notice that, while this 
imposes an upper limit on how many variables each ST2 path can include, it by no means restricts 
the number of variables the ST2E ensemble can includ e as a whole. 

To test this strategy, we use another simulation from I Wang et ali (|2009l Example 5) with p = 120 
and n — 50, designed specifically as a test case for p > n problems. The 120 covariates are generated 
from a multivariate normal distribution with mean zero and covariance matrix, 

^30x30 _ _ — 



^30x30 

J30X30 



J30X30 

^30x30 



-'30x30 
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where 




and 



3, J = 0.2 



V 



The first 60 coefficients are generated from N(3, 0.5) and then fixed for all simulations; the remaining 
60 coefficients are equal to zero. We are able to obtain and us e exactly the same set of 60 non-zero 



coefficients and the noise level for e (cr = 50) as those used bv lWang et ali (|2009f) 



Table 8: A "large p, small n" problem (Section 14. 3p . Minimal, median, and maximal number of 
times (out of 100 simulations) different types of variables (signal versus noise ) are s elected. Results 
other than those for ST2E and stability selection are taken from IWang et MI (l2009l Table 2). 







Xo G Signal 






Xj G Noise 




(j 


= 1,2, ...,60) 


(J-- 


= 61,62,... 


,120) 


Method 


Min 


Median 


Max 


Min 


Median 


Max 


Lasso 


19 


30 


40 


3 


8 


14 


Adaptive Lasso 


15 


25 


35 





7 


11 


Elastic Net 


40 


50 


61 


1 


5 


8 


Relaxed Lasso 


14 


23 


34 





3 


8 


VISA 


16 


27 


35 





2 


8 


Random Lasso 


76 


86 


95 


18 


29 


38 


ST2E (with SIS) 


81 


88 


95 





1 


5 


Stability selection 














X — in-2 

^min — 


4 


22 


52 





3 


10 


^rnin — 10 


13 


40 


80 


1 


6 


25 


^Tnin — 10 


73 


95 


100 


16 


38 


80 



Table [8] shows the results. Again, random Lasso is competitive in terms of catching the signals, 
but it makes a large number of false discoveries. Stability selection has the same difficulty as before 
— it misses signals when large values of A„im are used to regulate false discoveries, and its ability 
to catch the signals can match that of ST2E and random Lasso only if Xmin is set to be so low as 
to tolerate a very large number of false discoveries. 

Readers will find that the performance of ST2E (with SIS) on this p > n example is quite remark- 
able. This is because the extra pre-selection step really makes it an entirely different VSE altogether, 
since the stochastic generating mechanism (see Section IT^ and Table [7]) has fundamentally changed. 
Intuitively, it is easy to see that SIS can improve ensemble strength by screening out noise variables, 
while doing so on bootstrap samples can further enhance ensemble diversity. This reinforces the 
points we have made earlier in Section |4?2] — the question of how to design and/or characterize a 
good stochastic mechanism for generating VSEs is a very intriguing one indeed, and thinking about 
the strength and diversity of the ensemble often gives us some useful insights. 
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4.4 False positives versus false negatives 

Another striking phenomenon that we can observe from all our experiments is the extremely delicate 
balance between false positive and false negative rates in variable selection problems. It is very 
hard to reduce one without significantly affecting the other. For VSEs, the underlying stochastic 
generating mechanism is, again, critical. Among the four VSEs listed in Table [3 PGA seems 
to "care" more about false positives, whereas random Lasso appears to "care" more about false 
negatives. Stability selection allows users to control the number of false positives through a tuning 
parameter, but, as our experiments have shown, aggressive control of false positive rates necessarily 
leads to very poor false negative rates, and vice versa. 

Overall, ST2E appears to balance the two objectives nicely, and this is precisely why we think 
it is a valuable practical algorithm. But, of course, there is no reason to believe why one cannot 
find another generating mechanism to produce a VSE that will balance the two objectives even 
better. However, as we've alluded to earlier, the more interesting question is not whether we can 
find another mechanism, but how we can know that we have found a good one. This we leave to 
future research. 

5 Summary 

We are now ready to summarize the main contributions of this paper. First, we gave a formal and gen- 
eral description of the ensemble approach for variable selection. Next, we pointed out that Breiman's 
theory for prediction ensembles — in particular, the tradeoff between diversity and strength — is 
useful in guiding the development of variable-selection ensembles as well. Finally, we used a more 
structured stochastic mechanism, the ST2 algorithm, to construct a better variable-selection ensem- 
ble, ST2E, which we demonstrated to be more robust than other VSEs, and competitive against 
many state-of-the-art algorithms. 

References 

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In 

Second International Symposium on Information Theory, pages 267-281. 

Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics, 37, 
373-384. 

Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140. 
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. 

Donoho, D. L. (1995). De-noising by soft-thresholding. IEEE Transactions on Information Theory, 
41(3), 613-627. 



21 



Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least Angle Regression (with 
discussion). The Annals of Statistics, 32(2), 407-499. 

Fan, J. and Li, R. (2001). Variable selection via nonconcavc penalized likelihood and its oracle 
properties. Journal of the Am,erica,n Statistical Association, 96(456), 1348-1360. 

Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with 
discussion). Journal of the Royal Statistical Society Series B, 70, 849-911. 

Freund, Y. and Schapire, R. (1996). Experiments with a new boosting algorithm. In Machine 
Learning: Proceedings of the Thirteenth International Conference, pages 148-156, San Francisco. 
Morgan Kauffman. 

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals 
of Statistics, 29(5), 1189-1232. 

Friedman, J. H. and Popescu, B. (2003). Importance-sampled learning ensembles. Technical report, 
Stanford University. 

Friedman, J. H., Hastie, T. J., and Tibshirani, R. J. (2000). Additive logistic regression: A statistical 
view of boosting (with discussion). The Annals of Statistics, 28(2), 337-407. 

Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. 
Addison- Wesley, Reading, MA, USA. 

Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions 
on Pattern Analysis and Machine Intelligence, 20(8), 832-844. 

Jolliffe, I. T. (2002). Principal Component Analysis. Springer- Verlag, 2nd edition. 

Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer- Verlag. 

Meinshausen, N. (2007). Relaxed Lasso. Computational Statistics & Data Analysis, 52, 374-393. 

Meinshausen, N. and Biihlmann, P. (2010). Stability selection (with discussion). Journal of the 
Royal Statistical Society Series B, 72, 417-473. 

Miller, A. J. (2002). Subset Selection in Regression. Chapman and Hall, 2nd edition. 

Radchenko, P. and James, G. (2008). Variable inclusion and shrinkage algorithms. Journal of the 
American Statistical Association, 103, 1304-1315. 

ShmueU, G. (2010). To explain or to predict? Statistical Science, 25(3), 289-310. 

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal 
Statistical Society Series B, 58, 267-288. 



22 



Wang, S., Nan, B., Rosset, S., and Zhu, J. (2009). Random Lasso. Manuscript under review. 

Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model indenti- 
fication and regression estimation. Biometrika, 92, 937-950. 

Zhu, M. (2008). Kernels and ensembles: Perspectives on statistical learning. The American Statis- 
tician, 62, 97-109. 

Zhu, M. and Chipman, H. A. (2006). Darwinian evolution in parallel universes: A parallel genetic 
algorithm for variable selection. Technometrics, 48, 491-502. 

Zou, H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical 
Association, 101, 1418-1429. 

Zou, H. and Hastie, T. J. (2005). Regularization and variable selection via the elastic net. Journal 
of the Royal Statistical Society Series B, 67, 301-320. 



23 



